Phase 4: EDA

Summary



** Importing dataset and important libraries **

Understanding the dataset in depth

We will start the basic exploration of dataset before finding trends. This will help us gain better understanding of the dataset and might even lead to something totally new.


---> We will start by finding the total number of users. We had already completed the operation but lets go over it again.

We have exactly 206209 user data available in this dataset.


---> Moving on, lets find the range of total number of orders placed by each user.

The range of orders is completely skewed to the left. The average is approx 16 orders. Now we know that this dataset is not being dominated by certain number of user [which was our speculation previously].


---> Now we will find:

Aisle count for each Department

Product count for each Aisle

Now that we have merged all the tables, lets check for any null values

No null values, nice! Let's move on to completing the main task


Aisle count for each department

Hmm... there is department named missing,lets explore this issue in depth.

As we can see both department and aisle names are missing for these products. Lets check the total number of departments and aisles having "missing" as their names.

Speculation: There is only going to be one aisle and department named as missing. We can confidently say this because during the Data Cleaning phase we had checked the aisles.csv and departments.csv datasets for duplicate values and found none. Hence there is a chance that aisle_id 100 & department_id 21 are the only aisle and department having the name missing.

Even though their aisle and department names are missing, it does not mean they are not present in the transactional dataset. We already know that few products are missing from the main transactional dataset. We wil check later how many products belong to aisle 100.


Product count for each aisle

Missing Aisle i.e. aisle 100 has the highest number of products which is then followed by candy chocolate. From top 15 aisle we can see that this online grocery store serves in variety of aisles, each having a large variety of products.

Top 15 aisles with least amount of products. Apart from the first aisle i.e. "bulk dried fruits vegetables" rest have good amount of products associated with them.

---> Finding missing products from the main transactional dataset[ i.e.order_products__prior.csv]

Only 3 products are absent from the dept and aisle with missing name. We can also see that total of 11 products are not present in the main transactional dataset. To further investigate upon as to why these products are not present, this dataset is not suffucient to answer that and will require further investigation with the stakeholders.

** Questions to ans **

The following are the questions that will discover all the information stored in this dataset. This information can be used by Stakeholders or Marketing teams in ways suiting their needs.

  1. Each product's Frequency / Count.
  2. Analyzing purchase activity of customers for each day in a week.
  3. Analyzing purchase activity of customers for each hour in a day.
  4. Frequency / Count of reordered products.
  5. Identify potential products that are ordered after few days of gap

--> Question 1: Each product's Frequency / Count.

We have already completed this process previously and have also saved the results in a csv file. All we need to do is add product names and our result will be ready

Now we know that 11 products are missing from the transactional dataset. And as we have performed the outer join, there must be some null values.

Seems like frequency has null values. The reason only the frequency column has null values is because all product id were present in the product dataset but all products frequency were not present in the transactional dataset hence frequency has null values

The above output shows exactly 11 products that were not present in the main transactional dataset and hence there frequency is missing. We will fill nan values with zero


--> Question 2: Analyzing purchase activity of customers for each day in a week.

Answering this question will show us the most active days within a week.

Given the above graph, we can discern that 0 and 1 are weekends. 0 is Saturday and 1 is Sunday. 0 and 1 both have the high and almost the same activity level. While from 2 to 6 seem to have fairly the same amount of activity level.


--> Question 3: Analyzing purchase activity of customers for each hour in a day.

For Day 0

For Day 1

For Day 2

For Day 3

For Day 4

For Day 5

For Day 6


--> Question 4: Frequency / Count of reordered products.

There seem to be a lot of products missing. This may because these products were filltered out as they were never reordered. Lets confirm this speculation.

Lets get product names for the corresponding product ids. And save the dataframe in a csv file format


--> Question 5: Identify potential products that are ordered after few days of gap

Answering this question will helps us identify products that are ordered within a certain/small period of time. Identifying them will help the retailer to stock up on these products and avoid huge losses.

There are two peaks in the graph. 1st is at 7 and 2nd is at 30. This means that people tend to place orders after 7 days and 30 days majority of time.


We will now identify products that are ordered after gap of 7 days

First we got all the order ids that are placed after 7 days gap


Now we created a new dataframe that consisted transactional data of only the extracted order ids


Finally, we created final dataframe of product's frequency and added product names, aisle ids and department ids and saved it to a csv file.